Comparing inverted files and signature files for searching a large lexicon
نویسندگان
چکیده
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.
منابع مشابه
Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files
There are many advantages to be gained by storing the lexicon of a full text database in main memory. In this paper we describe how to use a compressed inverted file index to search such a lexicon for entries that match a pattern or partially specified term. This method provides an effective compromise between speed and space, running orders of magnitude faster than brute force search, but requ...
متن کاملFast Text Access Methods for Optical and Large Magnetic Disks: Designs and Performance Comparison
High capacity disks, especially optical ones, are commercially available. These disks are ideal for archiving large text data bases. In this work, we examine efficient searching techniques for such applications. We propose a unifying framework, which reveals the similarities between signature files and an inverted file using a hash table. Then, we design methods that combine the ease of inserti...
متن کاملFast Text Access Methods for Optical and Large Magnetic Disks: Design and Performance Comparison
High capacity disks, especially optical ones, are commercially available. These disks are ideal for archiving large text data bases. In this work, we examine efficient searching techniques for such applications. We propose a unifying framework, which reveals the similarities between signature files and an inverted file using a hash table. Then, we design methods that combine the ease of inserti...
متن کاملCLIP: A Compact, Load-balancing Index Placement Function
Existing file searching tools do not have the performance or accuracy that search engines have. This is especially a problem in large-scale distributed file systems, where better-performing file searching tools are much needed for enterprise-level systems. Search engines use inverted indices to store terms and other metadata. Although some desktop file searching tools use indices to store file ...
متن کاملCombining Pat-Trees and Signature Files for Query Evaluation in Document Databases
In this paper, a new indexing technique to support the query evaluation in document databases is proposed. The key idea of the method is the combination of the technique of pat-trees with signature files. While the signature files are built to expedite the traversal of object hierarchies, the pat-trees are constructed to speed up both the signature file searching and the text scanning. In this ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Inf. Process. Manage.
دوره 41 شماره
صفحات -
تاریخ انتشار 2005